25 research outputs found
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
Large-scale text-to-image generative models have been a revolutionary
breakthrough in the evolution of generative AI, allowing us to synthesize
diverse images that convey highly complex visual concepts. However, a pivotal
challenge in leveraging such models for real-world content creation tasks is
providing users with control over the generated content. In this paper, we
present a new framework that takes text-to-image synthesis to the realm of
image-to-image translation -- given a guidance image and a target text prompt,
our method harnesses the power of a pre-trained text-to-image diffusion model
to generate a new image that complies with the target text, while preserving
the semantic layout of the source image. Specifically, we observe and
empirically demonstrate that fine-grained control over the generated structure
can be achieved by manipulating spatial features and their self-attention
inside the model. This results in a simple and effective approach, where
features extracted from the guidance image are directly injected into the
generation process of the target image, requiring no training or fine-tuning
and applicable for both real or generated guidance images. We demonstrate
high-quality results on versatile text-guided image translation tasks,
including translating sketches, rough drawings and animations into realistic
images, changing of the class and appearance of objects in a given image, and
modifications of global qualities such as lighting and color
SceneScape: Text-Driven Consistent Scene Generation
We present a method for text-driven perpetual view generation -- synthesizing
long-term videos of various scenes solely, given an input text prompt
describing the scene and camera poses. We introduce a novel framework that
generates such videos in an online fashion by combining the generative power of
a pre-trained text-to-image model with the geometric priors learned by a
pre-trained monocular depth prediction model. To tackle the pivotal challenge
of achieving 3D consistency, i.e., synthesizing videos that depict
geometrically-plausible scenes, we deploy an online test-time training to
encourage the predicted depth map of the current frame to be geometrically
consistent with the synthesized scene. The depth maps are used to construct a
unified mesh representation of the scene, which is progressively constructed
along the video generation process. In contrast to previous works, which are
applicable only to limited domains, our method generates diverse scenes, such
as walkthroughs in spaceships, caves, or ice castles.Comment: Project page: https://scenescape.github.io
Revealing and modifying non-local variations in a single image
We present an algorithm for automatically detecting and visualizing small non-local variations between repeating structures in a single image. Our method allows to automatically correct these variations, thus producing an 'idealized' version of the image in which the resemblance between recurring structures is stronger. Alternatively, it can be used to magnify these variations, thus producing an exaggerated image which highlights the various variations that are difficult to spot in the input image. We formulate the estimation of deviations from perfect recurrence as a general optimization problem, and demonstrate it in the particular cases of geometric deformations and color variations.Israel Science Foundation (Grant 931/14)Shell Researc
MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
Recent advances in text-to-image generation with diffusion models present
transformative capabilities in image quality. However, user controllability of
the generated image, and fast adaptation to new tasks still remains an open
challenge, currently mostly addressed by costly and long re-training and
fine-tuning or ad-hoc adaptations to specific image generation tasks. In this
work, we present MultiDiffusion, a unified framework that enables versatile and
controllable image generation, using a pre-trained text-to-image diffusion
model, without any further training or finetuning. At the center of our
approach is a new generation process, based on an optimization task that binds
together multiple diffusion generation processes with a shared set of
parameters or constraints. We show that MultiDiffusion can be readily applied
to generate high quality and diverse images that adhere to user-provided
controls, such as desired aspect ratio (e.g., panorama), and spatial guiding
signals, ranging from tight segmentation masks to bounding boxes. Project
webpage: https://multidiffusion.github.i
TokenFlow: Consistent Diffusion Features for Consistent Video Editing
The generative AI revolution has recently expanded to videos. Nevertheless,
current state-of-the-art video models are still lagging behind image models in
terms of visual quality and user control over the generated content. In this
work, we present a framework that harnesses the power of a text-to-image
diffusion model for the task of text-driven video editing. Specifically, given
a source video and a target text-prompt, our method generates a high-quality
video that adheres to the target text, while preserving the spatial layout and
motion of the input video. Our method is based on a key observation that
consistency in the edited video can be obtained by enforcing consistency in the
diffusion feature space. We achieve this by explicitly propagating diffusion
features based on inter-frame correspondences, readily available in the model.
Thus, our framework does not require any training or fine-tuning, and can work
in conjunction with any off-the-shelf text-to-image editing method. We
demonstrate state-of-the-art editing results on a variety of real-world videos.
Webpage: https://diffusion-tokenflow.github.io
MoSculp: Interactive Visualization of Shape and Time
We present a system that allows users to visualize complex human motion via
3D motion sculptures---a representation that conveys the 3D structure swept by
a human body as it moves through space. Given an input video, our system
computes the motion sculptures and provides a user interface for rendering it
in different styles, including the options to insert the sculpture back into
the original video, render it in a synthetic scene or physically print it.
To provide this end-to-end workflow, we introduce an algorithm that estimates
that human's 3D geometry over time from a set of 2D images and develop a
3D-aware image-based rendering approach that embeds the sculpture back into the
scene. By automating the process, our system takes motion sculpture creation
out of the realm of professional artists, and makes it applicable to a wide
range of existing video material.
By providing viewers with 3D information, motion sculptures reveal space-time
motion information that is difficult to perceive with the naked eye, and allow
viewers to interpret how different parts of the object interact over time. We
validate the effectiveness of this approach with user studies, finding that our
motion sculpture visualizations are significantly more informative about motion
than existing stroboscopic and space-time visualization methods.Comment: UIST 2018. Project page: http://mosculp.csail.mit.edu